In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from ydata_profiling import ProfileReport
import sweetviz as sv
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

0. Loading & Verifying Sample Dataset¶

In [4]:
# Loading the dataset
# df = pd.read_csv(r"..\data\complaints.csv")
In [3]:
# Loading my 300k sample data
df = pd.read_parquet("../data/processed/cfpb_sample_300k.parquet")
In [5]:
print(f"Loaded {len(df):,} rows & {len(df.columns)} columns")

df.head()
Loaded 300,000 rows & 23 columns
Out[5]:
Date received Product Sub-product Issue Sub-issue Consumer complaint narrative Company public response Company State ZIP code ... Date sent to company Company response to consumer Timely response? Consumer disputed? Complaint ID year_quarter geo region stratum sample_n
0 2012-03-14 Bank account or service Checking account Making/receiving payments, sending money None None None BANK OF AMERICA, NATIONAL ASSOCIATION ND 58503 ... 2012-03-15 Closed with relief Yes No 35052 2012Q1 ND Midwest Bank account or service|2012Q1|Midwest 4
1 2012-03-20 Bank account or service Checking account Problems caused by my funds being low None None None TCF NATIONAL BANK MN 55125 ... 2012-03-21 Closed with relief Yes No 37573 2012Q1 MN Midwest Bank account or service|2012Q1|Midwest 4
2 2012-03-22 Bank account or service Checking account Making/receiving payments, sending money None None None WELLS FARGO & COMPANY MN 55110 ... 2012-03-23 Closed without relief Yes Yes 39793 2012Q1 MN Midwest Bank account or service|2012Q1|Midwest 4
3 2012-03-07 Bank account or service Checking account Making/receiving payments, sending money None None None Synovus Bank OH 44108 ... 2012-03-16 Closed without relief Yes No 34571 2012Q1 OH Midwest Bank account or service|2012Q1|Midwest 4
4 2012-03-20 Bank account or service Checking account Problems caused by my funds being low None None None PNC Bank N.A. PA 18944 ... 2012-03-23 Closed without relief Yes Yes 37047 2012Q1 PA Northeast Bank account or service|2012Q1|Northeast 6

5 rows × 23 columns

It seems that the sample dataset loaded properly.

1. Exploratory Data Analysis (EDA)¶

In [14]:
profile = ProfileReport(
    df,
    title="CFPB 300k Sample - YData Profiling Report",
    explorative=True  # richer, but still reasonable runtime
)
In [15]:
profile.to_notebook_iframe()
Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]
100%|██████████| 23/23 [00:43<00:00,  1.90s/it]
Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]
Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]
In [ ]:
# Save to HTML
profile.to_file("../reports/cfpb_300k_profile.html")
Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

1.1 Key Univariate Insights¶

Key Variable Insights

Variable Top insight
Product Credit reporting (61.5%)
Issue "Incorrect information": 42% of complaints

| Company | Top 1: 26.1% share (Equifax?) | Timely response | 98.2% 'Yes' | | Date received | Right-skewed: Recent surge 2025+ |

Correlation

  • Product "Credit reporting" ↔ Issue "Incorrect info" (0.85 - High)
  • Region "South" ↔ Debt collection (0.25 - Moderate)
  • Narrative length ↔ Timeliness (-0.05 - Mild)

Therefore Credit reporting + South region --> peak complaints.

1.2 South vs. Others¶

In [9]:
# South vs rest (your >50% sample focus)
south_df = df[df['region'] == 'South'].copy()
other_df = df[df['region'] != 'South'].copy()
In [12]:
# Generate comparison report
sweet_report = sv.compare([south_df, "South"], [other_df, "Others"])
                                             |          | [  0%]   00:00 -> (? left)
In [13]:
# Notebook iframe
sweet_report.show_notebook(scale=0.9)
In [11]:
sweet_report.show_html('../reports/south_vs_others_sweetviz.html')
Report ../reports/south_vs_others_sweetviz.html was generated! NOTEBOOK/COLAB USERS: the web browser MAY not pop up, regardless, the report IS saved in your notebook/colab files.

Top 5 South differentiators

Rank Feature South Signal Insight
1 Product Debt collection +15% ✅ Economic distress confirmed
2 Issue "Debt not owed" +12% Collection harassment hotspot
3 State TX/FL/GA dominate Sunbelt concentration
4 Company Regional banks ↑ Local players struggling?
5 Timeliness -1.2% (97.8% vs 99%) Ops lag detected

Key distributions

1. South HIGHER:

  • Payday loans (+8%)
  • "Late fees" issues (+6%)
  • ZIP codes: 7xxx, 3xxx (TX/FL)

2. South LOWER:

  • Student loans (-5%)
  • Credit cards (-3%)

3. Correlations:

  • South × Debt collection = 0.28 (strong).

Hypothesis Scorecard

My Expectation Sweetviz Says Verdict
Debt collection ↑ +15% ✅ STRONG
Timeliness ↓ -1.2% ✅ Mild
Payday ↑ +8% ✅ Confirmed

1.3 Key Findings¶

South = Debt collection crisis

  • +15% vs others
  • Regional ops 1.2% slower
  • TX/FL ground zero